Python Machine Learning 2nd Edition by Sebastian Raschka, Packt Publishing Ltd. 2017
Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition
Code License: MIT License
Note that the optional watermark extension is a small IPython notebook plugin that I developed to make the code reproducible. You can just skip the following line(s).
In [ ]:
%load_ext watermark
%watermark -a "Sebastian Raschka" -u -d -p numpy,pandas,matplotlib,sklearn
The use of watermark
is optional. You can install this IPython extension via "pip install watermark
". For more information, please see: https://github.com/rasbt/watermark.
In [ ]:
# Use the IPython/jupyter feature to show images inline with the notebook
# output rather than have images popup.
from IPython.display import Image
%matplotlib inline
In [ ]:
# Sample csv
import pandas as pd
from io import StringIO
import sys
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
# If you are using Python 2.7, you need
# to convert the string to unicode:
if (sys.version_info < (3, 0)):
csv_data = unicode(csv_data)
df = pd.read_csv(StringIO(csv_data))
df
In [ ]:
# Give a count of null values for each column
df.isnull().sum()
In [ ]:
# access the underlying NumPy array
# via the `values` attribute
df.values
In [ ]:
# remove rows that contain missing values
df.dropna(axis=0)
In [ ]:
# remove columns that contain missing values
df.dropna(axis=1)
In [ ]:
# remove columns that contain missing values
df.dropna(axis=1)
In [ ]:
csv_data = \
'''A,B,C,D
1.0,2.0,3.0,4.0
,,,
4.0,,,
4.0,6.0,,
5.0,6.0,,8.0
10.0,11.0,12.0,'''
# If you are using Python 2.7, you need
# to convert the string to unicode:
if (sys.version_info < (3, 0)):
csv_data = unicode(csv_data)
df2 = pd.read_csv(StringIO(csv_data))
df2
In [ ]:
# only drop rows where all columns are NaN
df2.dropna(how='all')
In [ ]:
# drop rows that have less than 3 real values
df2.dropna(thresh=3)
In [ ]:
# only drop rows where NaN appear in specific columns (here: 'C')
df2.dropna(subset=['C'])
In [ ]:
# again: our original array
df.values
In [ ]:
# impute missing values via the column mean
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data
In [ ]:
# impute missing values via the row mean
imr = Imputer(missing_values='NaN', strategy='mean', axis=1)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data
Documentation for sklearn.preprocessing.Imputer
In [ ]:
Image(filename='images/04_01.png', width=400)
Here we are using fit and transform methods as a preprocessor, for example from the Min Max Scaler, to map data from it's original form to one better suited for Machine learning.
Later the model created is used for transforming initial test data and then target data when it comes along.
As we have seen it is possible to either remove null data or impute values. One thing to bear in mind is the shape of the data you supply to the fit and transform methods must always be the same shape.
In [ ]:
Image(filename='images/04_02.png', width=300)
This time the fit method from, for example from Logistic Regression, is used with training data and training labels to generate a model.
This model is then used to predict outcomes of the test data. The output is labels.
In [ ]:
import pandas as pd
df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])
df.columns = ['color', 'size', 'price', 'classlabel']
df
In [ ]:
size_mapping = {'XL': 3,
'L': 2,
'M': 1}
df['size'] = df['size'].map(size_mapping)
df
And we can map these back if required
In [ ]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)
In [ ]:
import numpy as np
# create a mapping dict
# to convert class labels from strings to integers
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['classlabel']))}
class_mapping
In [ ]:
# class_mapping is the code we pass to the map function
# to convert class labels from strings to integers
df['classlabel'] = df['classlabel'].map(class_mapping)
df
In [ ]:
# reverse the class label mapping
inv_class_mapping = {v: k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df
In [ ]:
# To avoid doing this by hand we can use the sklearn.preprocessing library
# LabelEncoder method
from sklearn.preprocessing import LabelEncoder
# Label encoding with sklearn's LabelEncoder
class_le = LabelEncoder()
y = class_le.fit_transform(df['classlabel'].values)
y
In [ ]:
# reverse mapping
class_le.inverse_transform(y)
In [ ]:
# Just looking at color, size & price we can convert non numeric data with
# the LabelEncoder
X = df[['color', 'size', 'price']].values
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X
What is the problem with this approach?
In [ ]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()
In [ ]:
# return dense array so that we can skip
# the toarray step
ohe = OneHotEncoder(categorical_features=[0], sparse=False)
ohe.fit_transform(X)
In [ ]:
df
In [ ]:
# one-hot encoding via pandas - just color as a nominal value
pd.get_dummies(df[['price', 'color', 'size']])
In [ ]:
# one-hot encoding via pandas - both color and class label as nominal values
pd.get_dummies(df[['price', 'color', 'size','classlabel']])
In [ ]:
# multicollinearity guard in get_dummies
pd.get_dummies(df[['price', 'color', 'size']], drop_first=True)
In [ ]:
# multicollinearity guard in get_dummies
# - both color and class label as nominal values
pd.get_dummies(df[['price', 'color', 'size','classlabel']], drop_first=True)
In [ ]:
X
In [ ]:
# multicollinearity guard for the OneHotEncoder
ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()
In [ ]:
ohe.fit_transform(X).toarray()[:, 1:]
In [ ]:
df_wine = pd.read_csv('https://archive.ics.uci.edu/'
'ml/machine-learning-databases/wine/wine.data',
header=None)
# if the Wine dataset is temporarily unavailable from the
# UCI machine learning repository, un-comment the following line
# of code to load the dataset from a local path:
# df_wine = pd.read_csv('wine.data', header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline']
print('Class labels', np.unique(df_wine['Class label']))
df_wine.head()
In [ ]:
from sklearn.model_selection import train_test_split
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test =\
train_test_split(X, y,
test_size=0.3,
random_state=0,
stratify=y)
# X data
# y class label that will be used to train
# Test size 0.3 = 30% test data, the rest training data
In [ ]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
In [ ]:
X[0,:]
In [ ]:
X_train[0,:]
In [ ]:
X_train_norm[0,:]
In [ ]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)
In [ ]:
X_train_std[0,:]
A visual example:
In [ ]:
ex = np.array([0, 1, 2, 3, 4, 5])
print('standardized:', (ex - ex.mean()) / ex.std())
# Please note that pandas uses ddof=1 (sample standard deviation)
# by default, whereas NumPy's std method and the StandardScaler
# uses ddof=0 (population standard deviation)
# normalize
print('normalized:', (ex - ex.min()) / (ex.max() - ex.min()))
If the models we create performs much better on a training dataset than on the test dataset then it is very likely there is a problem with overfitting.
Overfitting shows are model does not generalise well so do not work well with yet unseen data.
Some options to deal with this:
Collecting more training data may not be an option and trying simpler models with fewer parameters may come down to trial and error.
Next we will look at penalising complexity via regularisation. Then Dimensional reduction via feature selection.
In [ ]:
Image(filename='images/04_04.png', width=500)
In [ ]:
Image(filename='images/04_05.png', width=500)
In [ ]:
Image(filename='images/04_06.png', width=500)
For regularized models in scikit-learn that support L1 regularization, we can simply set the penalty
parameter to 'l1'
to obtain a sparse solution:
In [ ]:
from sklearn.linear_model import LogisticRegression
LogisticRegression(penalty='l1')
Applied to the standardized Wine data ...
In [ ]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(penalty='l1', C=1.0)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy :', lr.score(X_test_std, y_test))
In [ ]:
lr.intercept_
This shows the intercept of each of the three models being used.
In [ ]:
# A numpy function to set precision
np.set_printoptions(8)
Here we can see the total number of weights that have not been brought to zero by using L1 regularization out of the maximum of 39.
$(13 dimensions \times 3 classes)$
In [ ]:
lr.coef_[lr.coef_!=0].shape
Here we can see the all the weights for the three classes and the 13 dimensions in the wine dataset.
In [ ]:
lr.coef_
With this information we can graph now the regularization strength effects the weights.
The default LogisticRegression inverse of regularization strength is 1. We can use a simple loop to go from $10^{-4}$ to $10^5$ to get the weights and then graph them.
for c in np.arange(-4., 6.):
lr = LogisticRegression(penalty='l1', C=10.**c, random_state=0)
In [ ]:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = plt.subplot(111)
colors = ['blue', 'green', 'red', 'cyan',
'magenta', 'yellow', 'black',
'pink', 'lightgreen', 'lightblue',
'gray', 'indigo', 'orange']
weights, params = [], []
for c in np.arange(-4., 6.):
lr = LogisticRegression(penalty='l1', C=10.**c, random_state=0)
lr.fit(X_train_std, y_train)
weights.append(lr.coef_[1])
params.append(10**c)
weights = np.array(weights)
for column, color in zip(range(weights.shape[1]), colors):
plt.plot(params, weights[:, column],
label=df_wine.columns[column + 1],
color=color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('weight coefficient')
plt.xlabel('C')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center',
bbox_to_anchor=(1.38, 1.03),
ncol=1, fancybox=True)
#plt.savefig('images/04_07.png', dpi=300,
# bbox_inches='tight', pad_inches=0.2)
plt.show()
Below we can see the effect on the training and test accuracy.
In [ ]:
lr = LogisticRegression(penalty='l1', C=0.01)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy :', lr.score(X_test_std, y_test))
In [ ]:
lr = LogisticRegression(penalty='l1', C=0.1)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy :', lr.score(X_test_std, y_test))
In [ ]:
lr = LogisticRegression(penalty='l1', C=1)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy :', lr.score(X_test_std, y_test))
In [ ]:
lr = LogisticRegression(penalty='l1', C=10)
lr.fit(X_train_std, y_train)
print('Training accuracy:', lr.score(X_train_std, y_train))
print('Test accuracy :', lr.score(X_test_std, y_test))
In [ ]:
from sklearn.base import clone
from itertools import combinations
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
class SBS():
def __init__(self, estimator, k_features, scoring=accuracy_score,
test_size=0.25, random_state=1):
self.scoring = scoring
self.estimator = clone(estimator)
self.k_features = k_features
self.test_size = test_size
self.random_state = random_state
def fit(self, X, y):
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=self.test_size,
random_state=self.random_state)
dim = X_train.shape[1]
self.indices_ = tuple(range(dim))
self.subsets_ = [self.indices_]
score = self._calc_score(X_train, y_train,
X_test, y_test, self.indices_)
self.scores_ = [score]
while dim > self.k_features:
scores = []
subsets = []
for p in combinations(self.indices_, r=dim - 1):
score = self._calc_score(X_train, y_train,
X_test, y_test, p)
scores.append(score)
subsets.append(p)
best = np.argmax(scores)
self.indices_ = subsets[best]
self.subsets_.append(self.indices_)
dim -= 1
self.scores_.append(scores[best])
self.k_score_ = self.scores_[-1]
return self
def transform(self, X):
return X[:, self.indices_]
def _calc_score(self, X_train, y_train, X_test, y_test, indices):
self.estimator.fit(X_train[:, indices], y_train)
y_pred = self.estimator.predict(X_test[:, indices])
score = self.scoring(y_test, y_pred)
return score
In [ ]:
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
# selecting features
sbs = SBS(knn, k_features=1)
sbs.fit(X_train_std, y_train)
# plotting performance of feature subsets
k_feat = [len(k) for k in sbs.subsets_]
plt.plot(k_feat, sbs.scores_, marker='o')
plt.ylim([0.7, 1.02])
plt.ylabel('Accuracy')
plt.xlabel('Number of features')
plt.grid()
plt.tight_layout()
# plt.savefig('images/04_08.png', dpi=300)
plt.show()
In [ ]:
k3 = list(sbs.subsets_[10])
print(df_wine.columns[1:][k3])
In [ ]:
k6 = list(sbs.subsets_[7])
print(df_wine.columns[1:][k6])
In [ ]:
knn.fit(X_train_std, y_train)
print('Training accuracy: %0.3f' % knn.score(X_train_std, y_train))
print('Test accuracy : %0.3f' % knn.score(X_test_std, y_test))
In [ ]:
knn.fit(X_train_std[:, k3], y_train)
print('Training accuracy: %0.3f' % knn.score(X_train_std[:, k3], y_train))
print('Test accuracy : %0.3f' % knn.score(X_test_std[:, k3], y_test))
In [ ]:
knn.fit(X_train_std[:, k6], y_train)
print('Training accuracy: %0.3f' % knn.score(X_train_std[:, k6], y_train))
print('Test accuracy : %0.3f' % knn.score(X_test_std[:, k6], y_test))
In [ ]:
from sklearn.ensemble import RandomForestClassifier
feat_labels = df_wine.columns[1:]
forest = RandomForestClassifier(n_estimators=500,
random_state=1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]),
importances[indices],
align='center')
plt.xticks(range(X_train.shape[1]),
feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('images/04_09.png', dpi=300)
plt.show()
This is great for finding discriminative features with one gotcha, if two or more features are highly correlated one feature may be highly ranked and information on the other feature(s) may not be fully captured. Not a problem if model performance is key but it would be if understanding feature importance is.
Scikit-learn implements a SelectFromModel object that selects features based on a user-specified threshold after model fitting. Note the forest from above is passed in. Here we set a threshold 0.1 to get the top 5 features.
sfm = SelectFromModel(forest, threshold=0.1, prefit=True)
In [ ]:
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, threshold=0.1, prefit=True)
X_selected = sfm.transform(X_train)
print('Number of samples that meet this criterion: %d out of in the training set %d' % (X_selected.shape[0], X_train.shape[0]))
In [ ]:
for f in range(X_selected.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
Look at chapter 5 for feature extraction
In [ ]: